What are Sankey Plots?

  • Flow diagram
  • Width of flows = amount
  • Named after Matthew Henry Phineas Riall Sankey
  • Flow and distribution of heat in steam engines, 1898

https://upload.wikimedia.org/wikipedia/commons/1/10/JIE_Sankey_V5_Fig1.png

History

  • First and most famous Sankey plot: Charles Minard's Map of Napoleon's Russian Campaign of 1812
  • Sankey diagram on map
  • Created 1869
  • Many data types

https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png

How are they used?

  • Energy: input, output, waste
  • International Energy Agency: Flow of energy for the entire planet from 1973 to 2019
  • Interactive: Show change of energy flow for selected country and years

https://www.iea.org/sankey/#?c=World&s=Balance

How are they used?

  • Eurostat: Interactive energy balance flow for EU or countries of EU from 1990 to 2020

https://ec.europa.eu/eurostat/web/energy/energy-flow-diagrams

How are they used?

  • Vote flows in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Vote flows in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Vote flows in elections

https://www.tagesschau.de/inland/btw21/waehlerwanderung-bundestagswahl-103.html

How are they used?

  • Vote flows in elections

https://www.economist.com/graphic-detail/2019/11/01/a-british-election-and-other-uncertainties

Wrap Up

  • Show flow of different categories between two or more steps
    • Start: Initial distribution of categories (usually the left side)
    • Flow: Redistribution between steps (usually from left to right)
    • End: New distribution (usually the right side)
  • Distributions can be calculated with flow data
  • Width of lines represent the volume or amount

Python libraries for Sankey plots

Data

https://download.statistik-berlin-brandenburg.de/0c8e82331bc2327a/802f7f020114/SB_A01-03-00_2020j01_BE.xlsx

In [2]:
data = {
    '2020': {
        # inputs
        'start': 3669491,  # census at start of the year
        'births': 38693,
        'immigration': 142923,
        # outputs
        'deaths': -37642,
        'emmigration': -144881,
        'error': -4496,
        'end': -3664088  # census at the end of the year
    }
}

flows = list(data['2020'].values())
labels = list(data['2020'].keys())
flows, labels
Out[2]:
([3669491, 38693, 142923, -37642, -144881, -4496, -3664088],
 ['start', 'births', 'immigration', 'deaths', 'emmigration', 'error', 'end'])

matplotlib

In [3]:
import matplotlib.pyplot as plt
from matplotlib.sankey import Sankey
In [4]:
sankey = Sankey()  # init
sankey.add(flows=flows, labels=labels)  # add flow(s)
sankey.finish()  # create
plt.show()  # show
In [5]:
scale = 0.0000001
sankey = Sankey(scale=scale)  # init with scale!
sankey.add(flows=flows, labels=labels)
sankey.finish()
plt.show()
In [6]:
sankey = Sankey(scale=scale)

# 0 (inputs from the left, outputs to the right),
# 1 (from and to the top) or -1 (from and to the bottom).
orientations = [0, -1, 1, -1, 1, -1, 0]

# add flow(s) with orientations
sankey.add(flows=flows, labels=labels, orientations=orientations)
sankey.finish()
plt.show()
In [7]:
pathlengths=[0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.1]

sankey = Sankey(scale=scale)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,
)  # add flow(s) with orientations and pathlengths
sankey.finish()
plt.show()
In [8]:
def format_number(n):
    return '{:,}'.format(abs(n))  # add thousand separator

# add number format
sankey = Sankey(scale=scale, format=format_number)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,   
)
sankey.finish()
plt.show()
In [9]:
sankey = Sankey(scale=scale, format=format_number)
sankey.add(
    flows=flows, labels=labels,
    orientations=orientations,
    pathlengths=pathlengths,
    facecolor='lightgray'  # change color
)
sankey.finish()
plt.title("Berlin Census 2020")  # add title
plt.show()
In [10]:
# add second year
data = {
    '2019': {
        'start 2019': 3644826,
        'births': 39503,
        'immigration': 184744,
        'deaths': -34739,
        'emmigration': -161513,
        'error': -3330,
        'end 2019': -3669491
    },
    '2020': {
        'start 2020': 3669491,
        'births': 38693,
        'immigration': 142923,
        'deaths': -37642,
        'emmigration': -144881,
        'error': -4496,
        'end 2020': -3664088
    }
}
In [11]:
flows_2019 = list(data['2019'].values())
labels_2019 = list(data['2019'].keys())
labels_2019[-1] = None  # remove last label
flows_2020 = list(data['2020'].values())
labels_2020 = list(data['2020'].keys())
In [12]:
pathlengths=[0.3, 0.3, 0.1, 0.1, 0.3, 0.5, 0.3]  # new pathlengths

sankey = Sankey(scale=scale, format=format_number)
sankey.add(  # add 2019
    flows=flows_2019, labels=labels_2019,
    orientations=orientations,
    pathlengths=pathlengths,
    facecolor='lightgray'
)
sankey.add(  # add 2020
    flows=flows_2020, labels=labels_2020,
    orientations=orientations,
    pathlengths=pathlengths,
    prior=0, connect=(len(flows_2019)-1, 0),  # connect second flow to first
    facecolor='darkgray'
)
sankey.finish()
plt.title("Berlin Census 2019 & 2020")  # add title
plt.show()

matplotlib notes

pySankey

In [13]:
import pandas as pd
from pySankey.sankey import sankey
In [14]:
# create DataFrame from 2020 data
df_2020 = pd.DataFrame([
    # start -> deaths
    {'source': 'start', 'target': 'deaths', 'value': 37642},
    # start -> emmigration
    {'source': 'start', 'target': 'emmigration', 'value': 144881},
    # start -> error
    {'source': 'start', 'target': 'error', 'value': 4496},
    # start -> end
    {'source': 'start', 'target': 'end', 'value': 3669491},
    # births -> end
    {'source': 'births', 'target': 'end', 'value': 38693},
    # immigration -> end
    {'source': 'immigration', 'target': 'end', 'value': 142923},
])
df_2020
Out[14]:
source target value
0 start deaths 37642
1 start emmigration 144881
2 start error 4496
3 start end 3669491
4 births end 38693
5 immigration end 142923
In [15]:
sankey(
    left=df_2020['source'], right=df_2020['target'],
    leftWeight=df_2020['value'],
    fontsize=14,
    #figure_name="Berlin Census 2020",  # used for saving png, not title
)

pySankey notes

  • https://github.com/anazalea/pySankey
  • only code documentation
  • examples do not work after pip install
  • figure_name not in docstring: used for saving file (not title)
  • try & error
  • only one flow step possible
  • looks more like a modern Sankey plot than matplotlib
  • great for simple plots

psankey

In [16]:
from psankey.sankey import sankey
In [17]:
nodes, fig, ax = sankey(
    df_2020, aspect_ratio=4/3,
    nodelabels=True, linklabels=True, labelsize=5,
)
plt.title("Berlin Census 2020")  # add title
plt.show()
In [18]:
# create DataFrame from 2019 & 2020 data
df = pd.DataFrame([
    # 2019
    {'source': '2019', 'target': '2020', 'value': 3644826},
    {'source': '2019', 'target': 'deaths `19', 'value': 34739},
    {'source': '2019', 'target': 'emmigration `19', 'value': 161513},
    {'source': '2019', 'target': 'error `19', 'value': 3330},
    {'source': 'births `19', 'target': '2020', 'value': 39503},
    {'source': 'immigration `19', 'target': '2020', 'value': 184744},
    # 2020
    {'source': '2020', 'target': '2021', 'value': 3669491},
    {'source': '2020', 'target': 'deaths `20', 'value': 37642},
    {'source': '2020', 'target': 'emmigration `20', 'value': 144881},
    {'source': '2020', 'target': 'error `20', 'value': 4496},
    {'source': 'births `20', 'target': '2021', 'value': 38693},
    {'source': 'immigration `20', 'target': '2021', 'value': 142923},
])
df.head(3)
Out[18]:
source target value
0 2019 2020 3644826
1 2019 deaths `19 34739
2 2019 emmigration `19 161513
In [19]:
nodes, fig, ax = sankey(
    df, aspect_ratio=4/3,
    nodelabels=True, linklabels=True, labelsize=5,
)
plt.title("Berlin Census 2019 & 2020")  # add title
plt.show()

psankey notes

  • https://github.com/mandalsubhajit/psankey
  • works with pd.DataFrames
  • works for multiple flow steps
  • short but helpful documentation in README.md
  • some smart options
    • nodemodifier to highlight nodes
  • node positions not customizable

holoviews

In [20]:
import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')
width, height = 600, 400
In [21]:
# run example code
sankey = hv.Sankey([
    ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Z', 6],
    ['B', 'X', 2], ['B', 'Y', 9], ['B', 'Z', 4]
])
sankey.opts(width=width, height=height)
Out[21]:
In [22]:
df_2020.head()
Out[22]:
source target value
0 start deaths 37642
1 start emmigration 144881
2 start error 4496
3 start end 3669491
4 births end 38693
In [23]:
# pass DataFrame from previous example
sankey = hv.Sankey(df_2020)
sankey.opts(width=width, height=height)
Out[23]:
In [24]:
df.head()
Out[24]:
source target value
0 2019 2020 3644826
1 2019 deaths `19 34739
2 2019 emmigration `19 161513
3 2019 error `19 3330
4 births `19 2020 39503
In [25]:
sankey = hv.Sankey(df)
sankey.opts(width=width, height=height, cmap='Set2',
            edge_color=dim('source').str(),
            node_color=dim('target').str())
Out[25]:

holoviews notes

plotly

In [26]:
import plotly.graph_objects as go
In [27]:
# example from https://plotly.com/python/sankey-diagram/
fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color="black", width=0.5),
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = "blue"
    ),
    link = dict(
      # indices correspond to labels, eg A1, A2, A1, B1, ...
      source = [0, 1, 0, 2, 3, 3],
      target = [2, 3, 3, 4, 4, 5],
      value = [8, 4, 2, 8, 4, 2]
  ))])
In [28]:
fig.update_layout(
    title_text="Basic Sankey Diagram",
    width=width, height=height, font_size=10)
fig.show()
In [29]:
# create DataFrame from 2019 & 2020 data
s, t, v, c = 'source', 'target', 'value', 'color'
df = pd.DataFrame([
    # 2019
    {s: '2019', t: '2020', v: 3644826, c: 'lightgray'},
    {s: '2019', t: 'deaths `19', v: 34739, c: '#a6cee3'},
    {s: '2019', t: 'emmigration `19', v: 161513, c: '#1f78b4'},
    {s: '2019', t: 'error `19', v: 3330, c: '#f1b6da'},
    {s: 'births `19', t: '2020', v: 39503, c: '#b2df8a'},    
    {s: 'immigration `19', t: '2020', v: 184744, c: '#33a02c'},
    # 2020
    {s: '2020', t: '2021', v: 3669491, c: 'lightgray'},
    {s: '2020', t: 'deaths `20', v: 37642, c: '#a6cee3'},
    {s: '2020', t: 'emmigration `20', v: 144881, c: '#1f78b4'},
    {s: '2020', t: 'error `20', v: 4496, c: '#f1b6da'},
    {s: 'births `20', t: '2021', v: 38693, c: '#b2df8a'},
    {s: 'immigration `20', t: '2021', v: 142923, c: '#33a02c'},
])
df.head(3)
Out[29]:
source target value color
0 2019 2020 3644826 lightgray
1 2019 deaths `19 34739 #a6cee3
2 2019 emmigration `19 161513 #1f78b4
In [30]:
# create nodes with index from DataFrame
# https://stackoverflow.com/a/69464558
import numpy as np
nodes = np.unique(df[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
nodes
Out[30]:
2019                0
2020                1
2021                2
births `19          3
births `20          4
deaths `19          5
deaths `20          6
emmigration `19     7
emmigration `20     8
error `19           9
error `20          10
immigration `19    11
immigration `20    12
dtype: int64
In [31]:
fig = go.Figure(
    data=[
        go.Sankey(
        node={
            "label": nodes.index,
        },
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
        })
    ]
)
In [32]:
fig.update_layout(
    title_text="Berin Census 2019 & 2020",
    width=width, height=height, font_size=10)
fig.show()
In [33]:
# create x, y and colors for the nodes
x = [.1, .4, .7,  # years
     .1, .4,  # births
     .3, .6,  # deaths
     .3, .6,  # emmigration
     .3, .6,  # error
     .1, .4,  # immigration
]
y = [.5, .5, .5,  # years
     .75, .8,  # births
     .2, .25,  # deaths
     .25, .3,  # emmigration
     .15, .2,  # error
     .7, .75,  # immigration
]
color = ["darkgray", "darkgray", "darkgray",
         "#b2df8a", "#b2df8a",  # light green
         "#a6cee3", "#a6cee3",  # light blue
         "#1f78b4", "#1f78b4",  # dark blue
         '#f1b6da', '#f1b6da', # light pink
         "#33a02c", "#33a02c",  # dark green
]
x, y
Out[33]:
([0.1, 0.4, 0.7, 0.1, 0.4, 0.3, 0.6, 0.3, 0.6, 0.3, 0.6, 0.1, 0.4],
 [0.5, 0.5, 0.5, 0.75, 0.8, 0.2, 0.25, 0.25, 0.3, 0.15, 0.2, 0.7, 0.75])
In [34]:
fig = go.Figure(
    data=[
        go.Sankey(
        arrangement = "freeform",
        node={
            "label": nodes.index,
            "x": x,
            "y": y,
            "pad": 100,  # padding between nodes,
            "color": color,
        },
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
            "color": df["color"],
        })
    ]
)
In [35]:
fig.update_layout(
    title_text="Berin Census 2019 & 2020", font_size=10
)
fig.show()

plotly notes

Takeaways

  • What do you want to show?
    • Are you interested in the flows or distributions?
  • Check if your data fits for sankey
    • only distributions is not enough
    • you need data for the flows
    • not too many categories
  • Make sure it is worth the effort
    • Get data into the correct format
    • Customizing the plot
  • Think about colors!
  • Search for good examples to guide you

Do

https://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png

Do

https://www.economist.com/graphic-detail/2019/11/01/a-british-election-and-other-uncertainties

Do

https://www.ipoint-systems.com/blog/from-data-to-knowledge-the-power-of-elegant-sankey-diagrams/

Don't

https://www.sankey-diagrams.com/intra-eu-horse-meat-trade/

Don't

https://www.sankey-diagrams.com/how-not-to-sankey/